Data Science

Review & Future Directions

2014-04-07
Instructor: Alessandro Gagliardi
TA: Kevin Perko

Agenda

Review
- What we covered
  - Data
  - Science
- What we didn't
  - Time Series Analysis
  - Network Analysis
Future Directions
- Types of Data Science
  - ~~4, 5,~~ 8 types of data scientists
- From Hacker to Operator
  - Experience
  - Mentorship

Review

What we covered:

Data

Big Data
- Hadoop
- IPython.parallel & StarCluster
APIs
- Twitter
- JSON
Relational Databases
- SQL
- 1st, 2nd, and 3rd Normal Form
Feature Vectors
- Data Frames
- Term-Document Matrices
Visualization
- ggplot2

Science

Regression
- Linear Regression
Classification
- Logistic Regression
- k-Nearest Neighbors
- Decision Trees
- Artificial Neural Networks
- Support Vector Classifiers
Dimensionality Reduction
- Principal Component Analysis
Clustering
- k-Means Clustering

What we didn't:

Pig
Hive, Impala/Presto
Spark, Shark
Natural Language Processing
Time Series Analysis
Network (i.e. Graph) Analysis
Interactive Visualization (e.g. D3.js)
...

Briefly:

Time Series Analysis

The main difference between time series analysis and other forms of analysis is that each instance is not independant of each other instance. (e.g. Sales on day $t$ may be related to sales on day $t-1$ while sales by clerk $x$ are (hopefully) independant of sales by clerk $x-1$)

Autoregression

The notation AR($p$) refers to the autoregressive model of order $p$. The AR($p$) model is written

$$ X_t = c + \sum_{i=1}^p \varphi_i X_{t-i}+ \varepsilon_t .\, $$

where $\varphi_1, \ldots, \varphi_p$ are parameters, $c$ is a constant, and the random variable $\varepsilon_t$ is white noise.

Moving Average

The notation MA($q$) refers to the moving average model of order $q$:

$$ X_t = \mu + \varepsilon_t + \sum_{i=1}^q \theta_i \varepsilon_{t-i}\, $$

where the $\theta_1, \ldots, \theta_q$ are the parameters of the model, $\mu$ is the expectation of $X_t$ (often assumed to equal 0), and the $\varepsilon_t$, $\varepsilon_{t-1}$,... are again, white noise error terms.

The moving-average model is essentially a finite impulse response filter with some additional interpretation placed on it.

ARMA model

The notation ARMA($p, q$) refers to the model with $p$ autoregressive terms and $q$ moving-average terms. This model contains the AR($p$) and MA($q$) models,

$$ X_t = c + \varepsilon_t + \sum_{i=1}^p \varphi_i X_{t-i} + \sum_{i=1}^q \theta_i \varepsilon_{t-i}.\,$$

*(from [Wikipedia](http://en.wikipedia.org/wiki/Autoregressive_moving_average))*

Extensions include:

ARIMA - Autoregressive integrated moving average
ARMAX - Autoregressive–moving-average with exogenous inputs

Other time series methods include:

Time Series Principal Component Analysis

Briefly:

Network Analysis

Like text mining, network--or, graph--analysis involves big, sparse matrices.

Graphs contain:

Nodes or Vertices
Links or Edges

Graphs can be:

Directed
Undirected
- Undirected graphs are directed graphs were all vertices are reciprocal

Directed graphs can be:

Cyclic
Acyclic

(A tree is an example of an directed acyclic graph or DAG)

...becomes...

	2	3	4	5	6
1	1	0	0	1	0
2		1	0	1	0
3			1	0	0
4				1	1
5					0

Small-World Networks

A network where the typical distance $L$ between two randomly chosen nodes (the number of steps required) grows proportionally to the logarithm of the number of nodes $N$ in the network, that is:

$$ L \propto \log N$$

For fun:

inmaps.linkedinlabs.com
wolframalpha.com/facebook

Future Directions

Categories of data scientists

*(according to [Vincent Granville](http://www.datasciencecentral.com/profiles/blogs/six-categories-of-data-scientists))*

Those strong in statistics: they are expert in statistical modeling, experimental design, sampling, clustering, data reduction, confidence intervals, testing, modeling, predictive modeling and other related techniques.
Those strong in mathematics: NSA or defense/military, astronomers, and operations research people doing analytic business optimization (inventory management and forecasting, pricing optimization, supply chain, quality control, yield optimization).
Those strong in data engineering, Hadoop, database/memory/file systems optimization and architecture, API's, Analytics as a Service, optimization of data flows, data plumbing.
Those strong in machine learning / computer science (algorithms, computational complexity)
Those strong in business, ROI optimization, decision sciences, involved in some of the tasks traditionally performed by business analysts in bigger companies (dashboards design, metric mix selection and metric definitions, ROI optimization, high-level database design)
Those strong in production code development, software engineering (they know a few programming languages)
Those strong in visualization
Those strong in GIS, spatial data, data modeled by graphs, graph databases
Those strong in a few of the above.

*(according to [Tomasz Tunguz](https://www.linkedin.com/today/post/article/20131002174328-4444200-which-of-the-five-types-of-data-science-does-your-startup-need))*

Quantitative, exploratory data scientists tend to have PhDs and use theory to understand behavior. Varian’s team researches the advertiser dynamics within the ads auction and compares those dynamics to theoretical auction models like theVickery auction. By combining theory and exploratory research, these data scientists improve products.
Operational data scientists often work in the finance, sales or operations teams at Google. In the AdSense ops . . . a star data analyst who each week would discuss our team’s performance: our email response times, the satisfaction scores of our publishers, and changes in publisher behavior by segment. His work provided a feedback loop to improve the team’s tactics and efficiency.
Product data scientists tend to belong to product management or engineering. PMs and engineers sift through logs and analysis tools to understand the way users interact a product and leverage that knowledge to refine the product. At Google, the ads quality team analyzed user clicks data to improve ad targeting.
Marketing data scientists segment the user base, evaluate the performance of advertising campaigns, match product features to customer segments, and design content marketing campaigns. The marketing data scientist creates awareness and leads for the sales team, helping generate revenue.
Research data scientists create insights as a product. Nate Silver is arguably the most famous of them. Silver’s work doesn’t influence a product; the analysis is the product itself. Sometimes the data science leads to a thought leadership whitepaper, or a blog post, or a financial report.

*(emphasis mine)*

*(according to [Harlan D. Harris](http://strata.oreilly.com/2013/06/theres-more-than-one-kind-of-data-scientist.html))*

Data Businesspeople are the product and profit-focused data scientists. They’re leaders, managers, and entrepreneurs, but with a technical bent. A common educational path is an engineering degree paired with an MBA.
Data Creatives are eclectic jacks-of-all-trades, able to work with a broad range of data and tools. They may think of themselves as artists or hackers, and excel at visualization and open source technologies.
Data Developers are focused on writing software to do analytic, statistical, and machine learning tasks, often in production environments. They often have computer science degrees, and often work with so-called “big data”.
Data Researchers apply their scientific training, and the tools and techniques they learned in academia, to organizational data. They may have PhDs, and their creative applications of mathematical tools yields valuable insights and products.

*(according to [Brendan Tierney](http://www.oralytics.com/2013/03/type-i-and-type-ii-data-scientists.html))*

The Type I Data Scientist specializes in...
- Statisticians
- Data Miners
- Predictive Modellers
- Machine Learning
- Data Warehousing
- Business Intelligence & Visualization
- Big Data
- R / Oracle / SAS / SPSS / etc.
The Type II Data Scientist approaches the types of problems that organisations are facing in a different way. They will concentrate on the business goals and business problems that the organisation are facing. Based on these they will identify what the data scientist project will focus on, ensuring that there is a measurable outcome and business goal. The Type II Data Scientist will be a good communicator, being able to translate between the business problem and the technical environment necessary to deliver what is needed. During the project the data science team will discovery various insight about the data. The Type II Data Scientist will prioritise these and feed them back to the various business units. Some of these insights can range from something new, verifying business knowledge beliefs, areas where better data capture is needed, improvements in applications, etc.

Types Levels of Data Science

*(according to [Steve Jones](http://service-architecture.blogspot.com/2014/03/what-are-types-of-data-scientist.html))*

Data Science Bluffers They are the people who get a spreadsheet with a bunch of data, apply a very basic statistical function and claim 'Hey its Data Science'.
Data Hackers 'the one eyed man in the kingdom of the blind'...people with a bit of skill, maybe a bit of training, but they aren't at the level of sophistication of an operator...don't mistake knowing how to apply one Machine Learning technique for actual knowledge.
Data Operators or Resident Data Scientists take predefined algorithms, statistical or machine learning, and then apply them to a specific company scenario and most crucially keep the parameters up to date so the algorithm continues to perform.
Data Magicians normally have mathematical or physics centric PhDs (often several), often focused in specific areas such as fluid dynamics, economics or super specific such as wind-turbines... The reason its science is because its testable and provable. They can show that their algorithm would have produced 5% improvement in performance over the past 5 years, and as it moves forward show how their approach has made a difference to the performance of a business.

You are probably at the "Data Hacker" level right now and should strive to become an Operator or Resident Data Scientist. (I'm at the Operator level and striving to become a Magician.)

Q. How do I go from Data Hacker to a Data Operator or Resident Data Scientist?

Experience - Challenge yourself with projects just outside of your ability and learn the techniques you need to achieve it.
Mentorship - Ideally someone in your organization invested in your growth.
- Also: the greater Data Science community

Tools to keep an eye on:

D3.js
Spark
Julia

Useful Languages (besides Python, R, and SQL)

Julia
Scala
Pig
JavaScript
Java (sorry)

Data Science

Review & Future Directions

Agenda

Review

What we covered:

Data

Science

What we didn't:

Briefly:

Time Series Analysis

Autoregression

Moving Average

ARMA model

Briefly:

Network Analysis

Small-World Networks

For fun:

Future Directions

Categories of data scientists

Types Levels of Data Science

Tools to keep an eye on:

Useful Languages (besides Python, R, and SQL)

Questions?